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NEURAL NETWORKS: WHAT NON-LINEARITY TO CHOOSE J? // 

Vladik Kreinovich, Chris Quintana* 

Abstract. Neural networks are now one of the (S 

p h ^ e “ eto0 ^ T ^ 

best for some problems, and some new functions that may be worth rying. 

1. FORMULATION OF THE PROBLEM. 

Neural networks are now one of the most successful learning formalisms (see e.| the recent 
survey in [Hecht-Nielsen 1991)). After the initial success of linear neural modeh, m winch > the 

output . il equal toth. 7h«‘^tnd 4 w7h 

«^#ieSSSSv£x2 

arises: what function /(y) do we choose? 

£ optimization will lead to even better results. 

S , Why is this problem o/srTors^is otK^al^in^hese 

ta ucs - 1 ot -faxmtig, .lien ^ ate-age^e commonly used logistic function (see beiow). 

* which ./(/) is optimal if we cannot compute .;(/; even :or a single /. Tnere aoes ..ot ,eem 
be a likeiy answer. 

However, we will show that this problem is solvable (and give the ^iution) using advanced 
math namely, group theory. (For a general idea of this approach, see [Kreinovich 1990].) 

2. WHAT IS KNOWN. 

0- *4-) — -zt Trt "i 

Grossberg (see, 


CO 


o 


■ nn s L he nrooerties oi neu i<u * , . 

Grossberg (see, e.g„ who 

properties useful for learnmg and is therefore functions with the same properties, 

class of possible functions, but there are still m y unc iertaken by Hecht-Nielsen, who in 
So we still have to make a choice. Another attack neura l network. This 

[1987] added a demand that any function must be approximate by some neur 

* Computer Science Department, University of Texas at El Paso, El Paso, TX 79968, USA 
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approach leads to non-trivial mathematics, but the final result (see, e.g., [Kreinovich 1991]) is 
that any smooth function / will work, so this additional demand does not help us to choose / . 

3. WHAT DO WE PROPOSE 

3.1. Motivations of the proposed mathematical definitions. 

We must choose a family of functions, not a single function. We speak about 
choosing /, but the expression for f(y) will change if we change the units in which we measure 
all the signals (input, output and intermediate), so in mathematical terms, it is better to speak 
about choosing a family of functions /. It is reasonable to suggest that if an / belongs to 
this family, then this family must contain kf for positive real numbers k. This corresponds to 
changing units. Also, it must contain / + c, where c is a constant. This is equivalent to adding 
a constant bias and therefore does not change the abilities of the resulting network. Since we 
are talking about non- linear phenomena, we can also assume that some non-linear ’’rescaling” 
transformations x — ► j(x) are also applicable, i.e., the family must include the composition 
for each of functions /. This family must not be too big, therefore, it must be determined 
by finitely many parameters and should ideally be obtained from one function f(y) by applying all 
these transformations. Without loss of generality, we can assume that this set of transformations 
is closed under composition and under inverse, i.e., if z — * g 3 (z ) and z — ► gi(z) are possible 
transformations, then z — ♦ gi(gi(z)) and z —* g^iz) are possible transformations, where by g ~ 1 
we denoted an inverse function g~ l {z) = w if and only if g\(w) = z. In mathematical terms this 
means that these transformations form a group , and therefore a family is obtained by applying 
to some function f(y) all transformations from some Unite- dimensional transformation group G 
that includes all linear transformations ( and maybe some non-linear ones). 

All these transformations correspond to appropriate "rescalings”. Rescaling is something 
that is smoothly changing the initial scale. This means that if we have two different transforma- 
tions, there must be a smooth transition between them. In mathematical terms, the existence of 
this continuous transition is expressed by saying that the group is connected , and the fact that 
both the transformations and the transitions are smooth is expressed by saying that this is a Lie 
group (see, e.g., Chevalley, 1946). 

What family is the best? Among all such families, we want to choose the best one. In 
formalizing what ’’the best” means we follow the general idea outlined in [Kreinovich 19901. The 
criteria to choose may oe computational simplicity, efficiency of training, or something else. In 
mathematical optimization problems, numeric criteria axe most frequently used, when to every 
family we assign some value expressing its performance, and choose a family for which this vaiue 
is maximal. However, it is not necessary to restrict ourselves to such numeric criteria only. For 
example, if we have several different families that have the same training; ability .4. we can moose 
between them the one that has the minimal computational complexity C. In tnis case, the actual 
criterion that we use to compare two families is not numeric, but more complicated: a family 
Fi is better than the family F 2 if and only if either A(Fj) > ^.(i^) or A(F 3 ) = .4(F 2 ) and 
C(Fi) < C{Fi). A criterion can be even more complicated. What a criterion must do is to allow 
us for every pair of families to tell whether the first family is better with respect to this criterion 
(we’ll denote it by F\ > F 2 ), or the second is better {F\ < Fi) or these families have the same 
quality in the sense of this criterion (we’ll denote it by F 3 ~ F 2 ). Of course, it is necessary to 
demand that these choices be consistent, e.g., if Fi > F 2 and F 2 > F 3 then F\ > F 3 . 

Another natural demand is that this criterion must choose a unique optimal family ('i.e., 
a family that is better with respect to this criterion than any other family). The reason for 
this demand is very simple. If a criterion does not choose any family at all, then it is of no 
use. If several different families are ’’the best” according to this criterion, then we still have a 
problem to choose among those ’’best”. Therefore, we need some additional criterion for that 
choice. For example, if several families turn out to have the same training ability, we can choose 
among them a family with minimal computational complexity. So what we actually do in this 
case is abandon that criterion for which there were several "best” families, and consider a new 
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"composite” criterion instead: Fi is bett “ nf^oThe d^criterion^hey had the same quality 
better according to the old cntenon or ^cordmg^ '^oUlcn^ other w ' rds , if a criterion does 

and F x is better than F 2 ac ? or J?| t fam ; 1 v ; t mea ns that this criterion is not ultimate; we have 

S ^r C o a mTt q o U : W Sol Sat will have that property. 

The next natural condition that the ^fjjnfi^ecox^dev ^nemon with 

Suppose that be easily simulated by the 

old one^ namely, the output of this new UkliisCthldd neuron 

so it is equivalent to an old neuron with , constant input -a. Therefore, the networks 

is equivalent to the new neuron with an , ^“^^^“^ a biUties as those that are built 
tha? are formed by these new « av ££££ same quality as the old ones 

because^ adding^! c'S 2££ 

- «»» V ^ eve. . the f-ly 
{/(y + a)} must be stm better than the family + a ' 

3.2. Definitions and the main result. 1 Bj^a |»e 'Ste 

mean "a* ^^^^^o^^o^ecte^Lie^gro^^ tr^^^^ons^By a n<m-constant 

G Let us denote the set of all the families by F. 

' A pair of relations «, ~) is called consistent [Kreinovi* 1990. 

V-tt” “ S 6 Cl W» ~ ‘ *»“ a < c; (6) if a ~ » and » < « 

then a < c; (7) if a < 6 then b < a or a ~ b are impossible. 

Assume a set A is giyem 1 ^ elements wU be ^^^^^^tematives 13 If a>b, we say 
we mean a consistent pair (<i ~) , alternatives a and b are equivalent with respect 

S^^rittriom^ ^ b€St) ** re$PeCt t0 1 Cnteri ° n 

(<, ~) if for every other alternative b either a > b or a ~ b. 

We say that a criterion is final if there exists an optimal alternative, and this optimal 

alternative is unique. 

Comment. In the present section we consider optimality criteria on the set F of all families. 

By the result of adding a to a Action /(y) - -- a taction ^ 

result of adding a to a family F we ^ th at an optimality criterion on F 

f^every number o, the foilowtng two 

conditions are true: _ 

i) if F is better than G in the sense of this cntenon (i.e., F > ), t en a 

ii) if F is equivalent to G in the sense of this criterion (i.e., F ~ G), then F + a ~ G t a. 

Comment. As we have already re m arked the a f first glance they may 

By a logistic function we mean s 0 (y) = 1/(1 + exp(-y)). 
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THEOREM 1. If a famil y F is optimal in the sense of some optimality criterion that is final 
and shift -invariant, then every function f from F is equal either to a + b$o{Ky + /) for some 
a, b, K and l, or a linear function a + bx, or to a + bexp(Kx) for some a, b , K. 

Comments 1. Logistic, hyperbolic-tangent, linear and exponential functions are really among 
the most popular (Kosko 1992]. 

2. We ass umed that / must be smooth. If we consider / that can be not smooth in some 
points, then it is natural to assume that on the intervals on which / is smooth, it must coincide 
with one of these functions. Such piecewise smooth functions have also been successfully used, 
the most popular are threshold functions that are obtained from the smooth ones by restricting 
their values to [0, oo) or [0, 1] [Kosko 1992]. 

(The proofs are given in Section 5). 

4. OPTIMIZATION OF NEURAL NETWORKS: RELATED RESULTS 

4.1, Scale-invariapce instead of shift invariance. In the above text we assumed that the 
optimality criterion is shift-invariant. The same arguments can be used to motivate the demand 

that the optimality criterion is invariant with respect to scaling transformations /(y) — ► /(ay) 
for some a > 0. Let us analyze the consequences of this demand. 

Definition. By the result of rescaling a function /(y) by a real number a > 0 we mean 
a function f(y ) = /(ay). By the result of rescaling a family F by a we mean the set of the 
functions, that are obtained from rescaling / .€ F by a. This result will be denoted by aF . We 
say that an optimality criterion on F is scale- invariant if for every two families F and G and for 
every number a the following two conditions are true: 

,-y if F is better than G in the sense of this criterion (i.e., F > G), then aF > aG\ 

iiy if F is equivalent to G in the sense of this criterion (i.e., F ~ G ), then aF ~ aG. 

THEOREM 2. If a family F is optimal in the sense of some optimality criterion that is final 
and scale- invariant, then every function f from F is equal to f(y) = (A + By~ a )/(C + Dy~ a ) 
for some A, B, C, D and a > 0. 

C omm ents. 1. In particular, for A = 0, B = C = D = 1 and a = 2 we get the Cauchy function 
_f(y) = 1/(1 -j- y 2 ), that is used in neural networks (see, e.g., [Hecht-Nielsen 1991]. If B - O^and 
a is an integer, a > 1, we get ratio-polynomial signal functions that have also been successfully 
used [Kosko 1992]. For a as 1, B = 0 and A = C = D = 1 this function equals to the expression 
x/(l + z), which was analyzed in [Munro 1986]. So this Theorem gives a list of possible optimal 
non- linear neurons that generalizes Cauchy and ratio-polynomial functions. 

2. By comparing the results of Theorems 1 and 2 one cam conclude that a scale- invariant 
criterion c ann ot be snift-invariant: indeed, in this case we could apply Theorem 2, so / must 
be described by the above expression. But these functions are different from the functions from 
Theorem 1, and so due to Theorem 1 this criterion is not shift-invariant. 

But wlia.t if w$ still went our criterion to be both shift- and sc ^e- invariant? .For standard 
neurons witn non-linearity of the type y = /(u>jx i + ... + w n x n ) it is impossible; in section 4.3 

we’ll show that it is possible for a more general type of neurons. 

4.2. More general families of neurons. A natural way to define a finite-dimensional family 
of functions is to fix finitely many functions /,(y) and consider their arbitrary linear combinations 

LiCJiiy). 

Definition. Let’s fix an integer m. By a basis we mean a set of m smooth functions 
/»(y), * = 1,2, ..., m. By a m-dimensional family of functions we mean all functions of the 
type /(y) = Xrfj Cifi(y) f° r s 01116 basis /,( y), where C{ are arbitrary constants. The set of all 
m-dimensional families will be denoted by F m . 
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Our definitions of the optimality criterion, final criterion and shift -invariant criterion can e 
applied to these families. 

the functions of the type y” exp(ay) sin(/3y + <j>), where pis a non-negative integer, a, 0 <t> 

are real numbers. 

n t Tr, rvartirnlar for d = 1 a = = 0 we get linear functions; for p = p = 0 we S e Jj 

for p = « = i = o we get a sine function, that has been successfully used 

[Braham 1989, Braham Hamblen 1990]. 

SS n e oX“|: /^por “ “/ 

toction of p tables. As in the above case, linear transformattons are easy to .mplement, 
therefore we can consider neurons of the type 


Xj , 


f(w n®i + ... + wi„x n , wnXi + ... + ..., w p iii + — + ^pnX„), 


where wij are weights. 

■n^finitiori Let’s fix integers m and p. By a basis we mean a set of m smooth functions 
f(v v ) i = 12 ..., m. By a m-dimensional family of functions of p variables, or m p- 

family for short, we mean a family that is formed by all functions of the type /(»..-.»,) = 

° vi) for some basis /(y. V,). where C, are arbitrary constants. The set of all 

m'p- dimensional families will be denoted by F m , p . 

Comment Since it is easy to implement arbitrary linear transformations, it is reasonable to 
de^d (iike wfdld above) that the relative quality of the family does not .change if we apply a 
shift or a rescaling to all its functions. So we amve at the following definitions. 

Definitions. Suppose that a vector S m (a„....a ,) is given. By the result ofzddmgS^ 

a function f(yu .... y p ) we mean a runction jMyi,—. Up) - j fyi t ai, .... y R • ?)•. - ‘ 

io k family F we mean the set of_r.he functions, that are obtamed trom / = F by 

adding a. This result will be denoted by F ^ a. 

Bv che result of rescaling a function f<'y i, •—.*/») by a vector a = with h > 0 * e 

' - *v „ i — .•/ , „ ,t Bv the result of rescaling a lamilv c oy a .ve 

mean a ;unction / (yi , l/p) — J K a \Vu —i a ?Jpr Y . „ , - rp. • resu i t w i 2 i Be 

mean the se^of the functions, that are obtamed from rescaling / € F by a. This result will 

denoted by aF. 

Now we can apply the definitions of the optimality criterion, final criterion, shift- and scale- 
invariant criterion to m-dimensional families. 

THEOREM 4. If an m, p-dimensional family F is optimal in the sense of some optimality 
™en Sn fiat is final, shift” and sc ale-invariant, than every fhr.rt.on / from F .s equal to a 

polynomial of y \ , ..., y v - 

Comment So if we consider neurons with a more complicated non-lineanty, we get the GMBH 
network (see [Hecht-Nielsen 1991, Section 5.6.1]), which historically is the first commercially 

successful non-lineax neuron. 4 

4 4 A related question: how to modify the weights during the training? This question 
is related to ^following problem: usually during training weights are changed bnewly ( 
w w + c for some constant a). However, sometimes some weights become so big that the 


output of the corresponding neurons is close to 1. These neurons are called saturated (see, e.g., 
[Wassennan 1989, pp. 90-91]. Saturation extends the training time, bringing the whole training 
process into a paralysis. To overcome paralysis, non-linear weight transformations are used: 
u> — ► g(w) for some non-linear g. The main purpose of this transformation is to get rid of big 
values, i.e., tp transform the whole set of real numbers {— oo,oo) into. some limited interval of 
values [—A, A]. But there are many functions with this property. So the natural question arises: 
what g is the best to choose ? 

The same arguments as in Sections 2 and 3 can be used to conclude that what we really 
need to choose is a family of functions, not just a function g , and that it is reasonable to assume 
that this family is obtained from some function g(w) by applying all the transformations from 
some appropriate transformation group. 

If a function g is better than a function g, then it is reasonable to assume that it will still 
be better if we first make one more standard training step w -* w + a and only then apply 
the non-linear transformation. These two consequent steps are equivalent to one transformation 
w — ► g(w+a). So the above demand means that if </(u>) is better than g( w), then g\(w) = < 7 (u>+a) 
must be better than g \(w) = g(w + a). In other words, the optimality criterion must be shift- 
invariant. Let’s turn to formal definitions. 

Definitions. Let us fix some real number A > 0. We say that a function is bounded if 
its values always belong to [—A, A]. By a family of weight transformations we mean the set of 
functions that is obtained from a smooth (everywhere defined) non-constant bounded function 
f(w) by applying all the transformations from some appropriate transformation group G. Let 
us denote the set of all such families by F w . Let us consider optimality. criteria on this set F w . 

THEOREM 5. If a famil y F w is optimal in the sense of some optimality criterion that is Snal 
and shift -in variant, then every bounded function g from F is equal to a - 1- bso(Ky 4- /) for some 
a, b, K and l. 

Comment. Such functions really proved to be the best in overcoming paralysis [Wassennan 1989, 
pp. 90-91]. 

5. PROOFS. 

Proof of Theorem 1. The idea ox 'his proof is as foilows: first we prove ;hai the appropriate 
transformation group consists of fractionaily-linear functions (in part i), then we prove that 
the optimal family is shift- invariant (in part 2), and from that in part 3 we conclude that any 
function f from F satisfies some functional equations, whose solutions are known. 

1. By an appropriate group we meant a connected finite-dimensional Lie group of transfor- 
mations of the set of real numbers R onto itself that contains all linear transformations. Norbert 
Wiener asked to classify such groups for an n-dimensional space with arbitrary n, and this clas- 
sification was obtained in [Guillemin Sternberg 1964] and [Singer Sternberg 1965]. In our case 
(when n = 1) the only possible groups are the group of all linear transformations and the group 
of all fractionaily-linear transformations x — + (ax -f 6)/(cx + d). In both cases the group con- 
sists only of fractionally linear transformations (the simplified proof for the 1-dimensional case 
is given in [Kreinovich 1987]; for other applications of this result see [Kreinovich Kumar 1990, 
1991], [Kreinovich Corbin 1991], [Kreinovich Quintana 1991]). 

2. Let us now prove that the optimal family F 0? t exists and is shift-invariant in the sense 
that F opt = Fopt -I- a for all real numbers a. Indeed, we assumed that the optimality criterion 
is final, therefore there exists a unique optimal family F opt . Let’s now prove that this optimal 
family is shift-invariant (this proof is practically the same as in [Kreinovich 1990]). The fact that 
F 0 p t is optimal means that for every other F, either F opt > F or F opt ~ F. If F opt ~ F for some 
F F op t, then from the definition of the optimality criterion we can easily deduce that F is also 
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optimal, which contradicts the fact that there is only one optimal family. So for every F either 
F opt > F or F opt = F. 

, ... . i f r _ r I n Tf v. > F = Foot + a, then from the invariance 

of the op'tim^ty precondition iifwe concede tha£ F * —>/•» “J Spos^ble^d 

3 Let us now deduce the actual form of the functions / from the optimal family. K /(») * 

M^ue^2^fr beloo^to^^But'^the f^^^^ fr^^can b^oMmned 

ussSf stfistrani . «i <» .. » ““■ssi 1 

[Aczel 1966, Section 2.3], i.e., if g(y) — (A + Bf{y))/{C + Df{y)), then 

g(y 0 - g(y 3 ) gto ) ~ gto) _ 
gto)-yto)yto)-yto) /to) -/to) /to) ~ a» 4 ) 

for all yi . In our case this is true for g(y) = /(y + a), therefore for all a the following equality is 

true * f(y i + « ) - /to + a) /to ± a ) -/to + Q ) _ /(yO ~ /to) /to| ~JS y ±L % 

/(y 3 + a) - /(y 3 + a) /(yi + a) - /to + a) /to) - /to) /to) - /to) 

The most general continuous solutions of this functional equation are given ' f Q “ 

fromTAczel 19661: either / is fractionally linear, or f(y) = (a + 6tan(fcy))/(c + dtan(fcy)) for 
SS'ittlrJfl,) = (. + • ..anh ( iv ))/(c + dtan h (,v)). wl «*•»*> = -nh(r)/ cosh(.). 

sinh(z) = (exp(z) - exp(-z))/2) and cosh(z) = (exp(z) + exp( r))/2). 

If fly) is fractionally Unear f(y) = (a + 6y)/(c + <*y) ^ <f 5 s th f“ *■“ 
equal to zero for y = -c/d. The only way for the fonctron to be defined for «fos^s = » ^ 

"e°U X = K^v)^'(V4(c + «;-d thifrnction/W .always equal to a constant 
6/d. But we assumed that / is noc a constant. So a 0 an / 

Let us prove that the expressions with tangent are impossible Indeed ^^h^for 

ky — arctan( c/d) e j a ¥L a * in thi« rase f is constant, and that contradicts to our 

SLmm“h^t ‘“so /= S m = (U + (Wtan(ly). Hence : either » = 0 
and /= const, or 6 # 0, and / is not defined, when tan(fcy) - co, i.e., when y /- 
y = ir/(2Jk). So expressions with tangent are really impossible. 

Let us now consider the case of hyperbolic tangent If * = °> then { 

• • -v .1 u -U n Tf k <■ 0 then we can take k = — k and use the fact that tanh is 

is impossible. So £ • _ Therefore, in the following we can assume that 

T> 0 Multiplying both the denominator and the numerator by cosh(z), we conclude that 
/(y) = ( acosh(fcy ) + b sinh( fcy))/(c cosh( ky ) + d sinh( fcy ) )• We then substitute the for 

sinh and cosh in terms of exp, and conclude that f(y) = (A exp(fcy) + B exp(-fcy))/(C exp(*yj + 
n ( u \\ fXr cnmp 4 B C D Multiplying both denominator and numerator by exp( ky), 

sasas fi2ra?Ra«64Ssa as? saassraa 

when both C and D are different from 0. 
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' If C and D have different signs, then for exp(2£y) - -DjC the denominator ®^alsj to 
zero and so just like in the tangent case, we conclude that / is either identically » 

or not de^ed^n this point y = k(-D/C)/(2*). If C and D have the same signs , then for 
/ = - In (D/C) we have C + D exp(— 2&y) = C( 1 + (D/C)exp(-2ky)) = C(l+e^>(-(2ky + )). 
If we substitute exp(-2*y) = exp(-(2*y + /))exp(/) = [C/D)^-( 2ky+l)) into the , aumerator, 
we get A + (BC/D)exp(-(2^y + 0), and therefore /(y) = (A + (BC/p)exp(-(2*y + /)))/(C(l + 
exo(-(2kv + / ))) One can check (by substituting the expression of the logistic function s 0 m 
terms of exp) that Ss expression is equal to (A/C) + (B/D - A/C^y + 0- So we get the 
desired expression for K = 2k. Q.E.D. 

Proof of Theorem 2. Just like in the proof of Theorem 1, we conclude that /(ay) _= (A + 
B f(v))/(C+Df(y)). This functional equation is almost the same as the one we solved in Theorem 
1 with the only exception being that here we have a product instead of a sum. It is well 
that if we turn to logarithms, then the logarithm of a product is equal to the sum of logarithms. 
So in order to reduce this case to the one already analyzed, let us introduce the new variable 

Y = lny (so that y = exp Y), and a new function F(Y) = /(exp(y)). For tbs function, the 
above functional equation takes the following form: for every E there exist A,B, C,D such tha 
F(Y + E) = (A + B F(Y))/(C + DF(Y)) (E = In a). In the proof of Theorem 1, we have already 
enumerated the solutions of this equation, so F(Y) is either a fractionally-iinear function, or a 
fractionally- linear transformation of tan(fcy) or tanh(fcy). If we know F , then using the equation 
F(Y) = /(ex p(y)), we can reconstruct /(y) = F(\ny) for y > 0. Similar expressions can be 
obtained for y < 0: in this case we need to use Y = In |y|. So in order to complete the proof, 
must substitute lny into the expressions enumerated above, and choose those that are denne 
everywhere. 

Let us first consider the case when F(Y) = (a + bY)/(c + dY ). Substituting lny instead of 

Y we eet f(v ) = (a + b\ny)/(c + dlny). This function must be smooth. Let us compute the 
derivative /': /'(y) = (( b/y)(c + dlny) - (d/y)(a + 6 In y))/(c + dlny) 2 . K ' bfc - ajd = 0, then 
as in Theorem 1, we can conclude that / is identically constant. If b/c-a/d jt 0, then for y — U 
this derivative tends to oo, so such functions / are not smooth at 0. 

The fact that the expressions with tangent are impossible is proved just like in Theorem 
1 So the only remaining case is the case of tanh, in which the function F(Y) can be reduced 
to F(y) = (.4 + Bexp(-2kY))/(C + Dexp(-2kY)). Substituting Y = lny, and usmg the fact 
that exp(— 2Jfclny)) = (exp(ln y))~ 2k = y -u , we conclude that f{y) = {A + By a )j(C + Dy °), 
where a = 2£. Q.E.D. 

Proof of Theorem 3. As in the proof of Theorem 1, we come to a conclusion that the. optimal 
f amil y F ovt exists and is shift-invariant. In particular, for every i the result /,(y + a) ot sluicing 
fi(y) must belong to the same family, i.e., /,(y+a) = Cii(a)/i(y)+Ci 2 (a)/ 2 (y)+.. ; +Cim(a)/m(y) 
for some constants C,;, depending on a. Let us first prove that these functions C tj {a ) are 
differentiable. Indeed, if we take m different values yit, 1 < A: < m, we get m linear equations for 

f % (y k + a) = C,i(a)/i(yjfc) + + ••• + m (y k )i 

from which we can determine C i} using Kramer’s rule. Kramer’s rule expresses every unknown as 
a fraction of two determinants, and these determinants polynomially depend on the coefficients. 
The coefficients either do not depend on a at all ( fj(yk )) or depend smoothly (/«(y* + a)) because 
fi are smooth functions. Therefore these polynomials are also smooth functions, and so is their 
fraction Cij(a)- 

We have an explicit expression for /,(y + a) in terms of fj{y) Cij. So, when a = 0, 
the derivative of fi(y + a) with respect to a equals to the derivative of the expression. If we 
differentiate it, we get the following formula: f k (y) = c,i/i(y) + c, 2 / 2 (y) + — + Cimfm(y), where 
c _ c (0). So the set of functions /,(y) satisfies the system of linear differential equations with 


634 


constant coefficients. The general soiution of such system is weU known [Beilman 1970], so we 
zet the desired expressions. Q.E.D. , 

sr== 

by a also belong to this optimal family F. 

For shift-invariance this means that 

= c a (3>! .(VI. » Vr) + — + C,m(S)/m(yi. -.yp)- 

In particular, if we take a, = ... = <■, = 0, we conclude that 

/ j(y , + «, , V> Vp) = Cu(oi)/i(w,V> V,) + C«(«i)/a(»i »,) + - + )/«(»! Vr). 

If we fix some values yj,...,y, and denote ,i ^^^^^ , .''^^’u 1 ^^^equatim» 1 that 1 we 

CiiK)<Ji(yi) + •■•+. C 'Se“orSf of ThMrem 3 <md therefore each of g, is equal to the hnear 
have already solved in the proof i formulation of Theorem 3. 

combination of the functions yf exp(a Vl ) sin(/3 yi + 0) from the lormma 

Likewise scale-invariance means that 

**)- Cam <».» ft) + CaW/W* ») + •■■ + «?<-(«)/-(»..■•■.»)• 

If we take as = ... -«,«!. we conclude have 

functional equation is almost the same as for Chen wl proved Theorem 2, and so 
a product instead of a sum We already \e Y = ln»i (so that Vl = expOO). 

we know what trick to apply, we mus in ro u c . „ « \\ q>^ en f or these new functions 

and new functions Gi(Y) = *(«p(n) (so fHat 5j(y_i) = (A)G m (Y). This is 

this functional equation takes the form Gi( + ) . ) £ how to ^ ve . So we can con- 

3. U When we ‘substitute Y = bjlh, we conclude that g,(yu jf^he* com“lica“d. 

nation of the functions (In y 1 ) p exp(alny 1 )sin( / 31ny 1 + ^. v . , 5C ,ve 

The only simplification that we can apply ;.sj.o change ^ A 

conclude that * is a linear comoination ot the functions ■ m y u Vi sm( . A ' 

So for the same functions g, we hav l^° ^ 

of shift- invariance ana scale- invariance. Ti . c • cann0 t be a unear combina- 

belong to both classes? If ^bJcause there are no logarithms among them. The 

tion of the functions from Theorem 3, b |^ a ^ , when^a linear combination of the 

th^s^e^htS combination of the functions 
functions ( Vi) Vi \ n — = 0 In this case the above expression turns into y ; . 

y[ exp(ayi) sm(^yi + <p) j s when p P • rnnrlude that a = p. But p is necessarily a 

ZSSZZStt So the dependency of /. on „ is polynomial. 

Similarly we can conclude that /, t**^**^,*^, ^m F^i/a iine^r 
and therefore all the functions /, are polynomials of y%- -7-pif n E D 
Smbination of these polynomials, and therefore a polynomial itself. Q.E.D. 

Proof of Theorem 5. We can repeat the proof of TW» J -d^come tc -cl = ‘hat 

function is left. Q.E.D. 
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